Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

170

Applications in Computer Vision

Algorithm 13 Training 1-bit detectors via LWS-Det.

Input:

The

training

dataset,

pre-trained

teacher

model.

Output:

1-bit

detec-

tor.

1: Initialize αi and β^oⁱ

i ^∼N⁽⁰^,^{1) and other real-valued parameters layer-wise;}

2: for i = 1 to N do

while Diﬀerentiable search do

Compute L^Ang

, L^Amp

, L^W

end while

6: end for

7: Compute L^GT, L^Lim

8: for i = N to 1 do

Update parameters via back propagation

10: end for

We introduce the DARTS framework to solve Eq. 6.72, named diﬀerential binarization

search (DBS). We follow [151] to eﬃciently ﬁnd wi. Speciﬁcally, we approximate wi by the

weighted probability of two matrices whose weights are all set as −1 and +1, respectively.

We relax the choice of a particular weight by the probability function deﬁned as

p^o^k

ok∈O

exp(β^o^k

i ⁾

o^′

k^∈O^exp(^β

o^′

i ⁾

, s.t. O = {w⁻

i ^,^w⁺

i ^}^,

(6.73)

where p^o^k

is the probability matrix belonging to the operation ok ∈O. The search space

O is deﬁned as the two possible weights: {w⁻

i ^,^w⁺

i ^}^{. For the inference stage, we select the}

weight owning the max probability as

*wi,l = arg max

p^o^k

i,l^,

(6.74)

where p^o^k

i,l ^{denotes the probability that the}^l^{-th weight of the}ⁱ^{-th layer belongs to operation}

ok. Therefore, the l -th weight of *w, that is, *wi,l, is deﬁned by the operation having the

highest probability. In this way, we modify Eq. 6.87 by substituting wi to *wi as

L^Ang

= ∥

ai−1 ⊗wi

∥ai−1∥2∥wi∥2

−

ai−1 ⊙*wi

∥ai−1∥2∥*wi∥2

∥²

2^.

(6.75)

By this, we retain the top-1 strongest operations (from distinct weights) for each weight

of wi in the discrete set {+1, −1}.

6.4.4

Learning the Scale Factor

After searching for wi, we learn the real-valued layers between the i-th and (i+1)-th 1-bit

convolution. We omit the batch normalization (BN) and activation layers for simplicity. We

can directly simplify Eq. 6.69 as

L^Amp

= Ei(αi; wi, *wi, ai−1, ai−1).

(6.76)

Following conventional BNNs [77, 287], we employ Eq. 6.80 to further supervise the scale

factor αi. According to [235], we employ a ﬁne-grained limitation of the features to aid in

the prior detection. Hence, the supervision of LWS-Det is formulated as

L = L^GT+ λL^Lim+ μ

i=1

(L^Ang

+ L^Amp

) + γ

i=1

L^w

i ^,

(6.77)